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Abstract 


We consider high-dimensional multi-class classification of normal vectors, where unlike 
standard assumptions, the number of classes may be also large. We derive the (non-asymptotic) 
conditions on effects of significant features, and the low and the upper bounds for distances 
between classes required for successful feature selection and classification with a given accuracy. 
Furthermore, we study an asymptotic setup where the number of classes is growing with the 
dimension of feature space and sample sizes. To the best of our knowledge, our paper is the first 
to study this important model. In particular, we present an interesting and, at first glance, 
somewhat counter-intuitive phenomenon that the precision of classification can improve as 
the number of classes grows. This is due to more accurate feature selection since even weak 
significant features, which are not sufficiently strong to be manifested in a coarse classification, 
can nevertheless have a strong impact when the number of classes is large. We consider both 
the case of the known and the unknown covariance matrix. The performance of the procedure 
is demonstrated on simulated and real-data examples. 

Keywords: Feature selection; high-dimensionality; misclassification error; multi-class classification; 
sparsity. 

1 Introduction 

Classification has been studied in many contexts. In the era of “big data” one is usually interested 
in classifying objects that are described by a large number of features and belong to many different 
groups. For example the large hand-labeled ImageNet dataset http://www.image-net.org/ 
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contains 10,000,000 labeled images depicting more than 10,000 object categories where each image, 
on the average, is represented by 482 x 415 « 200,000 pixels (see Russakovsky et al ., 2015 for 
description and discussion of this data set). 

The challenge of handling large dimensional data got the name of “large p small n” type of 
problems which means that dimensionality of parameter space p by far exceeds the sample size 
n. It is well known that solving problems of this type require rigorous model selection. In fact, 
the results of Bickel and Levina (2004), Fan and Fan (2008), Shao et al. (2011) demonstrate that 
even for the standard case of two classes, classification of high-dimensional normal vectors without 
feature selection is as bad as just pure random guessing. 

Although there exists an enormous amount of literature on classification, most of the existing 
theoretical results have been obtained for the case of two classes (binary classification). See 
Boucheron, Bousquet and Lugosi(2005) and references therein for a comprehensive survey. In 
particular, binary classification of high-dimensional sparse Gaussian vectors was considered in Bickel 
and Levina (2004), Fan and Fan (2008), Donoho and Jin (2009 ab), Ingster, Pouet and Tsybakov 
(2009) and Shao et al. (2011) among others. Tewari and Bartlett (2007) and Pan, Wang and Li 
(2016) discuss generalizations of the results for binary classification to multi-class classification. 
They established consistency of the proposed classification procedures but the number of classes 
was assumed to be fixed. 

On the other hand, a significant amount of effort has been spent on designing methods for 
the multi-class classification in statistical and machine learning literature. We can mention here 
techniques designed to adjust pairwise classification to multi-class setting (Escalera et al., 2011; Hill 
and Doucet, 2007; Jain and Kapoor, 2009), adjustment of the support vector machine technique 
to the case of several classes (Crammer and Singer, 2001; Lee, Lin and Wahba, 2004) as well as a 
variety of approaches to expand the linear regression and the neural networks to accommodate the 
multi-category setup (see, e.g., Gupta, Bengio and Weston, 2014). 

However, although a variety of techniques for multi-class classification have been developed, 
to the best of our knowledge, so far nobody studied how the growing number of classes affects 
the accuracy of both the feature selection and the classification. At first glance, the problem of 
successful classification when the number of classes is large seems close to impossible. On the other 
hand, humans have no difficulty in distinguishing between thousands of objects, and the accuracy of 
state-of-the-art computer vision techniques are approaching human accuracy. How is this possible? 
One of the reasons why multi-class classification succeeds is that selection of appropriate features 
from a large sparse p-dimensional vector becomes easier when the number of classes is growing: even 
weak significant features that are not sufficiently strong to be manifested in a coarse classification 
with a small number of classes may nevertheless have a strong impact as the number of classes 
grows. Simulation studies in Davis, Pensky and Crampton (2011) and Parrish and Gupta (2012) 
support such a claim. Arias-Castro, Candes and Plan (2011) reported on a similar phenomenon 
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for testing in the sparse ANOVA model. 

This paper is probably the first attempt to rigorously investigate the impact of the number of 
classes on the accuracy of feature selection and classification. In particular, we consider multi-class 
classification of high-dimensional normal vectors based on training samples. We assume that only 
a subset of truly significant features really contribute to separation between classes (sparsity). For 
this reason, we carry out feature selection and assign the new observed vector to the closest class 
w.r.t. the scaled Mahalanobis distance in the space of the selected significant features. 

We start with a non-asymptotic setting and derive the conditions on effects of significant 
features, and the low and the upper bounds for distances between classes required for successful 
feature selection and classification with a given accuracy. All the results are obtained with the 
explicit constants. Our finite sample study is followed by an asymptotic analysis for a large number 
of features p. where the number of training sample sizes n and, unlike previous works, even the 
number of classes L may grow with p. Our findings indicate that having larger number of classes 
aids the feature selection and, hence, can improve classification accuracy. On the other hand, larger 
number of classes require having larger number of significant features p\ for their separation. 

The rest of the paper is organized as follows. In Section [2] we present the feature selection and 
multi-class classification procedures and derive the non-asymptotic bounds for their accuracy. An 
asymptotic analysis is considered in Section [3j Section [4] discusses adaptation of the procedure 
in the case of the unknown covariance matrix. In Section [5] we illustrate the performance of the 
proposed approach on simulated and real-data examples. Some concluding remarks are summarized 
in Section [6j All the proofs are given in the Appendix. 

2 Feature selection and classification procedure 

2.1 Notation and preliminaries 

Consider the problem of multi-class classification of p-dimensional normal vectors with L classes: 

Y u = m, +e H , l = l,...,L] i = l,...ni, (1) 

where m/ £ is the vector of mean effects of p features in the Z-th class and tu ~ N( 0 P ,£) 
with the common non-singular covariance matrix £ £ M pxp . To clarify the proposed approach we 
assume meanwhile that £ is known and discuss the situation with the unknown £ in Section |4j 

After averaging over repeated observations within each class, model ([Tj) yields 

Y, = m,+e?, l = 1,..., L (2) 

where e* ~ N(0 P , n z _1 £). 
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The objective is to assign a new observed feature vector Yo E to one of the L classes. Denote 

L 

N = n h Pi = i'll/(ni + 1) and L\ = L — \, (3) 

l=i 

where evidently 1/2 < pi < 1. 

Since Var(Yo — Y;) = p/ 1 £, we assign Yo to the class l with the nearest centroid Yi w.r.t to 
the scaled Mahalanobis distance: 

1 = arg min { Pl (Y 0 - Y^S" 1 ^ - Y,)} . (4) 

It is well-known (see, e.g., Bickel and Levina, 2004, Fan and Fan, 2008 and Shao et ai, 2011) 
that the performance of classification procedures is worsening as the number of features grows (curse 
of dimensionality). Hence, dimensionality reduction by feature selection prior to classification is 
crucial for large values of p. 

Re-write © in terms of the one-way multivariate analysis of variance (MANOVA) model as 
follows: 

Y l= S + ^+el l = 1,..., L] (5) 

where m; = S + /3 ; , 6 is the vector of mean main effects of features and f3ij, j = 1,..., p is the 
mean interaction effect of j-th feature with Z-th class, with the standard identifiability conditions 
Ya =i Plj = 0 for each 3 = 1, ■ ■ ■ ,P- 

The impact of j-th feature on classification depends on its variability between the different 
classes characterized by the interactions j3ij, l = 1,..., L in the model ([5| . The larger are the 
interactions, the stronger is the impact of the feature. A natural global measure of feature’s 
contribution to classification is then Pfy Note that a feature may still have a strong 

main effect Sj but its contribution to classification is nevertheless remains weak if it does not vary 
significantly between classes, that is, if b? is small. The main goal of feature selection is to identify 
a sparse subset of significant features for further use in classification. 

2.2 Oracle classification 

First, we consider an ideal situation where there is an oracle that provides the list of truly significant 
features with bj > 0. In this case, we would obviously use only those features for classification, 
thus, reducing the dimensionality of the problem. Define indicator variables Xj = I{b? > 0}, and 
let pi = J2j =i x j an d Po = P ~ Pi be, respectively, the numbers of significant and non-significant 
features. Without loss of generality, we can always order features in such a way that those p\ 
significant features are the first ones. The classification procedure Q then becomes 

1 = argmin { Pl (Y { j - Yf ^(S*)"'^ - Y?)} , (6) 

1 <1<L 
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where Yq,Y* E M Pl are the truncated versions of Yo and Y; respectively: Y 0 * = Yoj and 
Yj* = Yij, j = 1,... ,pi, and £* E M piXpi is the corresponding upper left sub-matrix of £. 

Theorem [T] provides an upper bound for misclassification error of the oracle classification 
procedure ([6]): 


Theorem 1. Consider the model 0 and the equivalent model 0 Let m* k E ML 1 , k = 1 ,L, 
be the truncated versions of class centers and assume that for all pairs of classes 


( m fc — m^/) t (S*) 


8 ln(Li/a) 
min(p fc , p k ') 


1 + 


V 7 ! — PkPk' 


1 + 


2pi 


ln(Li/a) 


(7) 


for some 0 < a < 1. 

Let a new observation Yq from the class l be assigned to the l-th class according to classification 
rule |6p. Then, the misclassification error is 


P(l / l) < a 


( 8 ) 


Condition ([T]) verifies that classes should be sufficiently separated from each other (in terms 
of Mahalanobis distance) to achieve the required classification accuracy. Theorem [2] below, which 
is a direct consequence of Fano’s lemma for the lower bound of misclassification error (see, e.g., 
Ibragimov and Hasminskii, 1981, Section 7.1), implies that the condition 0 is essentially necessary 
for a successful classification and cannot be significantly improved (in the minimax sense). 

Theorem 2. Consider the model |7]). Let a new observation Y q be from one of L classes. If 

A 2 = min (mr-m^(S*)- 1 «-0<2HlnL 1 (9) 

for some H > 0, then 

inf max Pi{^{Y 0 ) + l) > 1 - N - r^-, (10) 

b In L i 

where Pi is the probability evaluated under the assumption that Yq belongs to the l-th class, and the 
infimum is taken over all classification rules tjj(Y o) : Yq —> {1,... ,L}. 


2.3 Feature selection procedure 

Consider now classification setup in the MANOVA model ([5]) with a more realistic scenario, where 
a set of significant features is unknown and should be identified from the data. 

Following our previous arguments, a j-th feature is not significant (irrelevant) for classification 
if it has zero interaction effects with all classes, that is, if f3ij = 0, j = 1,..., L or, equivalently, 
b'j = 0. Then, for each j = 1,... ,p we need to test the null hypothesis Hoj : bj = 0. An obvious 
test statistic is then 

L 

Cj = '£n l (Y lj -Y. j ) 2 , (11) 

i=i 
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where aj = Tijj and Y.j = Yij- Under the null, (j ~ x|,: while under the alternative 

Qj ~ Xl-i-h-i where xi r/i is the non-central chi-square distribution with the non-centrality 
parameter /jlj = erf 2 J2i=i Note that unless E is diagonal, (j’s are correlated. 

For a given 0 < a < 1, define a threshold 

A = L\ + 2\JL\ ln(2 p/a) + 2 ln(2p/a) (12) 


and select the j-th feature as significant (reject Hoj) if 

L 

Y^n l (Yi j -Y. j ) 2> A (13) 

i=i 

The following theorem shows that under certain conditions on the minimal required effect for 
significant features, the proposed feature selection procedure correctly identifies the true (unknown) 
subset of significant features with probability at least 1 — a: 


Theorem 3. Consider the feature selection procedure with the threshold J7g[ ) for some 

0 < a < 1. Define indicator variables Xj = I{&J 2 J2iLi n i(Yij — Y.j ) 2 > A}, j = 1 
Let 




* 


min a- 

i<ji<pi ^ 


L 




(14) 


and assume that for all p± truly significant features one has 


fi* > 4 ^31n(2p/a) + \J.L\ ln(2p/a)^ 


(15) 


Then, 


P(x = x) > 1 — a 


The condition (15) on the total minimal effect for significant features can be re-formulated in 
terms on their average effect per class: 


1 


L 


3 1=1 


31n(2p/a) 


+ 


ln(2p/a) 

L 


j = l,...,pi 


(16) 


Thus, as the number of classes increases, even significant features with weaker effects within each 
class become manifested and contribute to classification. Effect of a certain feature that remains 
latent and unnoticed in coarse classification with a small number of classes may be expressed in a 
finer classification. 


2.4 Classification rule and misclassification error 

Consider now the classification rule ([6]), where the unknown true Xj are replaced by Xj following 
the proposed feature selection procedure. Let p± = J2j =i &j be the number of features declared 
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significant and po = p — pi- Again, order the features in such a way that those pi features selected 
as significant are the first ones. Thus, the resulting classification rule can then be presented as 
follows: 

1 = argmin { Pl (Y* - Yf)^*)-'^ ~ Y?)} , (17) 

1 <1<L 

where the truncated vectors Yq,Y* G M^ 1 , l = 1 are defined now as Y 0 * = Yqj, Yfi = 

Yij, j = 1,... ,p\, and £* G R PlXpi is the corresponding upper left sub-matrix of £. 

We have 

P(l t -l) < P(t 7 ^ l | x = x) + P(x 7 ^ x ), (18) 


where, by Theorem [I] and Theorem [3j each probability in the RHS of (18) is at most a. Thus, the 
following result holds: 


Theorem 4. Consider the model 0 and the corresponding model Assume the conditions 0 
and (15) hold for some 0 < a < 1/2. Apply feature selection procedure (13) and use the selected 
features for classification via the ride 0 Then, 


P (correct classification) > 1 — 2a 


3 Asymptotic analysis 

Conditions 0 and ( [l5| ) (or (flG]) ) of Theorems 1 and 2, respectively, provide the non-asymptotic 
lower bounds on the minimal distance between different classes and the minimal effect of significant 
features required for the perfect feature selection and classification error bounded above by 2a. In 
order to gain better understanding of these conditions, we consider an asymptotic setup. 

Standard asymptotics considered in classification literature assume that the number of features 
p and the sample sizes ni increase whereas the number of classes L is fixed (see, e.g., Fan and Fan, 
2008; Shao et al ., 2011 for L = 2 and Pan, Wang and Li, 2016 for a general fixed L). On the 
contrary, our study is motivated by the case where the number of classes may also be large. 

Recall that N = y/fh, rik is the total sample size and let the number of features p —> oo. 
Assume that all eigenvalues of the p\ x p\ covariance matrix of significant features T,* are finite 
and bounded away from zero, i.e., there exist absolute constants ci and C2 such that 

0 < T\ < A min (£*) < A ma x(X*) < 72 < OO. (19) 

Assume that the samples sizes ni within each class also grow with p and, for simplicity of exposition, 
are of the same asymptotic order, that is, n\ ~ .. .ul ~ n, where n = N/L and a ~ b means 
a = b{ 1 + o(l)). In such asymptotic setup, pi = ni/{ni + 1) ~ 1, while \/l — pipk ~ y/2 jn. The 
results in the previous section allow one to study various other settings with unequal class sizes as 
well but the analysis of a vast variety of such possible scenarios is beyond the scope of this paper. 
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Consider now the condition (J7J) of Theorems [I] and [4] on the minimal separation Mahalanobis 
distance between any two class centers as p and n tend to infinity, where the number of significant 
features p\ and the number of classes L may increase with p , and a. may depend on n,p and L. 
Thus, Q yields: 


min (m* k - m£,)*(E*) 1 (m* k - m* k ,) > ~ 81n(Li/a) f 1 + --L 


1 + 


2pi 


ln(Li/a) 


( 20 ) 


Define 


r/i = lim 


Pi 


p->oo V nln(Li/a) 


Depending on r/i, the condition (20) implies two possible asymptotic regimes for A*: 


2 j 81n(^ L )(l + r/i), 0 < t/i < oo (sparse regime - small number of significant features) 

8 \J Pl ln ^ 1 ^-, r/i = oo (dense regime - large number of significant features) 

( 21 ) 

Theorem [2] implies that for 7/1 < 00 (sparse regime), the condition (20) is also essentially 
necessary for successful classification: 

Proposition 1. Let L —>■ 00 and p\ —> 00 as p —> 00 . Let a new observation Ko be from one of L 
classes. If 

A* ~ 2 S p . In L u 


where 5 Pl 0 arbitrarily slow as p —> 00 , then 


lim inf max PAiffY 0 ) / I) = 1, 

p—>00 ip 1 < 1 <L 

where Pi is the probability evaluated under the assumption that Y 0 belongs to the l-th class, and the 
infimum is taken over all classification rules tp(Y 0 ) : Vo —>■ {1, -.. ,L}. 

The validity of Proposition [l] follows immediately from Theorem [2] 

It is natural that for successful classification the between-class distances should grow with L. 
Note, however, that unless the number of classes increases exponentially with p\ , the growth rate 
of is o(p\) and the corresponding average per-feature distances ^-(m£ — m* k ,) t ('P*)^ 1 (m* k —m* k ,) 
still tend to zero. 

Similarly, from the condition (15) in Theorems [ 3 ] and [4] on the minimal required effect for 
significant features for the perfect feature selection, we have asymptotically 

b1= min cr~ — (31n(2p/a) + JL\ \n(2p/a)\ 
lAifipi J J n V / 


/ln(2 p/a) 

t/ 2 = bm W--- 

p->oo y L\ 


Let 















Then, 


bl ~ 


4 n 1 \JL\ ln(2p/a)(l + 3772 ), 0 < 772 < oo (large number of classes) 

12 n -1 ln( 2 p/a), 772 = 00 (small number of classes) 


( 22 ) 


r ]2 = 00 

and the threshold A in (fl2| for feature selection can be presented as 


A 


Li(l + 2r]2 + 27/|), 0 < r )2 < 00 
21n(2p/a), 7/2 = 00 


To gain some insight on the minimal required effect for a significant feature to contribute to 
classification as the number of classes increases, assume for simplicity that each significant feature 
has equal effects on each class, that is, = ... = 0^ L = /5? in (jEJ). Since 0 < 772 < 00 implies that 
L is large, so that L\ = L — 1 ~ L, condition (22) yields as p —> 00 : 




n 7 / 2(1 + 3772 ), 


12(7^77 l L 1 ln(2 p/a), 1/2 = 00 


0 < r /2 < 00 (large number of classes) 
(small number of classes) 


(23) 


Since r /2 is decreasing with L for a given value of a, the required minimal level for /3? in the 


RHS of (23) decreases as L grows and, therefore, more significant features may become manifested 


in classification for larger number of classes. Thus, while it might be hard to perform coarse 
classification with a set of weak features, their impact grows as one considers finer and finer 


separation between objects (see also the corresponding remarks at the end of Section 2.3). 


In particular, for fixed L (commonly, L = 2) and n = o(p), conditions (20) and (23) are of the 
form ~ Ci and /3| ~ C^n -1 ln(p/a), C \, C 2 > 0 and are similar to those of Fan & Fan 
(2008, Theorem 1 and Theorem 3). See also the results of Donoho and Jin (2009 a,b) and Ingster, 
Pouet & Tsybakov (2009) for closely related setups. 


4 Unknown covariance matrix 

So far the covariance matrix £ was assumed to be known. In practice, however, it should be usually 
estimated from the data. The standard MLE estimator 

L 

” (24) 


1 


_ 71 ; 


E = #^^CY«- Y I )(Y iI -Y I )* 

V Z=1 i =1 

or the similar unbiased pooled estimator commonly used in MANOVA behave poorly for high¬ 
dimensional data. However, under the sparsity assumption, the proposed classification procedure 


requires only to estimate the variances 07 in feature selection procedure ( 11 ) and the inverse of 


the upper left x p\) sub-matrix £* of £ in classification rule ( [17] ) . Thus, when pi < p, a 
low-dimensional matrix (£*) -1 may be still a good estimator of the true sub-matrix (£*) -1 and 
(under some additional mild conditions) may be used instead of the latter in ©• 
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Assume that p < § e^ N L ^ 4 . Replace in (11) by = T,jj and consider the feature selection 
procedure (13) with a somewhat larger threshold 


Ai — 


A 


1 — AC 


p,N,L,a 


where A is the threshold (12) used for the case of known variances and 


f^p,N,L,a — 2 


ln(2p/a) + 9 ln(2p/q) < ^ 


(25) 


(26) 


N-L N-L 

The following theorem shows that under slightly stronger conditions on the minimal effect for 
significant features, the above feature selection procedure with estimated aj still controls the 
probability of correct identification of the true subset of significant features. 

Theorem 5. Let 0 < a < 1/2 and assume that p < j e ( N ~ L )/ 4 . 

Define indicator variables 

L 

% = 7 ^ 7 2 E n i(yij - Y.j) 2 > Ai}, j = l,...,p (27) 


i=i 


with X\ given in (25). Assume that p* in © satisfies 

p* + Li — 2 \](Li + 2 pfi) ln(2p/o) > Ai(l + Kp,N,L,a ) 

Then, 


(28) 


P{x = x)>\ — 2 a 

Consider now the classification procedure ( |17[ ). In what follows we assume that X* is non¬ 
singular. Let X* be the corresponding upper left p\ x p\ sub-matrix of X, i.e. 

, L ni 

S * = N E E( Y ^ - Y?)(YS - Y?)*, (29) 

V 1=1 i=l 

where YA are the corresponding p \-dimensional truncated versions of Y^. 


Assign Yq the Lth class by replacing the true (unknown) (X*) 1 in (17) by (X 


W. 


I = argmin Pl (Yq - {T?)~ L (Y* 0 - Y, 

Kl<L 1 


)}• 


(30) 


Then the following version of Theorem [4] holds: 

Theorem 6. Consider the model |l]) and the corresponding model |^), where p < j e ( N ~ L )/ 4 ) 


m K (L,21 I1 (^)< pl < i L(W|) 


N 


for some 0 < a < 1/4 and C\ is an absolute constant specified in the proof. Denote 

7 ”" A ' = V (e*)V^T 

minv / T 


(31) 


( 32 ) 
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and note that j p1: n < 1 due to (31). Assume the condition (28) and a somewhat stronger version 
of the condition &, namely, 

8 ln(Li/a) 


)*( s *) 1 ( m fe — m fc') > 


(1 - 7 ^ 1 ,at) min (pk,Pk') 
1 + 


1 - (! - 1p uN )Pkpk' 


1 + 


2 pi 


ln(Li/a) 


(33) 


Apply feature selection procedure (21) and use the selected features for classification via the rule 


(30). Then, 


P (correct classification) >1 — 4 a 

Theorem [6] shows that for a sparse setup the proposed classification procedure can still be used 
when the covariance matrix is unknown and estimated from the data. 

5 Examples 

In this section we demonstrate the performance of the proposed feature selection and classification 
procedure on simulated and real-data examples. 


5.1 Simulation study 

The class means were generated as i.i.d. normal vectors mi N(0, cr^X), l = 1 where 

X pxp is a diagonal matrix with Xi = 1 for p\ indices and Xi = 0 for others. Since the vectors 
generated in this manner do not necessarily satisfy our assumptions, in order to reduce an 
impact of a particular choice of vectors mi, we generated M± replications of the class means. 
Furthermore, following the model (§, for each replication of class means mi, l = 1,..., L we 
generated M 2 sets of training samples Y[ji = mij + ej>, j = 1 ,...,p; i = 1 ,...,n, where 
are i.i.d. 1V(0, n _ 1 E). Similarly to Pan, Wang and Li (2016) we considered three choices for 
E. In Example 1 features were independent, i.e. E = ct 2 I p . In Example 2 we considered 
the autoregressive covariance structure with E/i 1 i a 2 = cr 2 0.5l /ll_/l 2 l, while in Example 3 we set 
= 0-2 (O-^ + 0.51 {hi = / 12 }, hi,li 2 = 1 , ■ ■ ■ ,p implying equal variances cr 2 and all covariances 
equal to ct 2 /2 (compound symmetric structure). Finally, for each of Mi-M 2 sets of training samples, 
we drew a test set of M 3 new vectors from randomly chosen classes as i.i.d. normal vectors N (mi, E). 


We carried out simulations with both the true covariance matrix E and its MLE E given by (24). 


The estimation was of high accuracy and the performance of feature selection and classification 
procedures in both cases were similar. In what follows we present only the results obtained with 
E. 

For each training sample we first carried out the feature selection procedure described above 


with the threshold Ai defined in (25) and a = 0.05. Subsequently, we used the selected features for 
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classifying M 3 vectors from the corresponding test set according to the rule (30). In the case when 
it delivered a non-unique solution, we chose one of the suggested solutions at random. 




Example 1 

Example 2 

Pi 

T 

L = 2 

L = 10 

L = 20 

L = 50 

L = 2 

L = 10 

L = 20 

O 

kO 

II 

10 

1 

1.000 

.996 

.975 

.785 

1.000 

1.000 

.978 

.788 


2 

.936 

.297 

.033 

.000 

.991 

.592 

.186 

.000 


3 

.880 

.158 

.006 

.000 

.898 

.147 

.003 

.000 

50 

1 

1.000 

.995 

.976 

.785 

1.000 

.995 

.977 

.783 


2 

.975 

.604 

.187 

.001 

.979 

.609 

.172 

.001 


3 

.896 

.158 

.005 

.000 

.901 

.146 

.004 

.000 

100 

1 

1.000 

.996 

.975 

.784 

1.000 

.996 

.976 

.782 


2 

.976 

.601 

.177 

.001 

.981 

.611 

.169 

.000 


3 

.895 

.149 

.005 

.000 

.898 

.142 

.004 

.000 

200 

1 

1.000 

.995 

.976 

.783 

1.000 

.995 

.977 

.783 


2 

.975 

.605 

.172 

.000 

.980 

.617 

.175 

.000 


3 

.892 

.150 

.004 

.000 

.895 

.150 

.004 

.000 


Table 1: Average proportions of false negative features for p = 500 and various values of L, p\ and 
r over M\ ■ M 2 = 2500 training samples. 


In all simulations we used M\ = M 2 = M 3 = 50, p = 500, a = 1 and n = 20. Note that 
classification precision depends on the variance ratio r 2 = a‘j n /{a 2 /n) that may be viewed as a 
signal-to-noise ratio. For this reason, we studied performance of feature selection and classification 
for various combinations of pi, L and r. In particular, we usedpi = 10, 50,100, 200, L = 2,10, 20, 50 
and several values of r depending on p\. 

The results of simulations indicate that for such data generating model (somewhat different 


from that analyzed in the paper), the threshold Ai in (25) (as well as A in (12) for the known 
varainces) might be too high, especially for small values of r. The latter led to an over-conservative 
feature selection procedure. Thus, in all simulations the feature selection procedure did not detect 
false positive features. The information on the proportions of false negative features (over the total 
number of significant features) for several combinations of pi, L and r over Mi • M 2 = 2500 training 
samples is summarized in Table 1 for Example 1 and Example 2 (the results for Example 3 were 
similar and we omit their presentation to save the space). In particular, Table 1 clearly shows that 
for small values of r and small L, due to the over-conservative feature selection procedure, almost 
not a single significant feature has been detected and the resulting classification is then essentially 
reduced to just a pure random guess. However, for any r the detection rate improves as L grows. 
The improvement rate is very fast for r > 2. Thus, for L = 50 the vast majority of significant 
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Figure 1: Average misclassification errors as functions of r for various combinations of p\ and L 
for Example 1. 


features were detected even within quite a strong noise. As we have mentioned, this affects the 
classification precision since weaker significant features that remained latent in coarse classification 
become active and may have a strong impact with increasing L (see below). 

For each combination of p \, L and t we calculated the corresponding average misclassification 
errors: see Figures m for Examples 1-3, respectively. Figures [l]|3] show similar behavior for 
all three examples. For any p\ and L misclassification error tends to zero as r increases. The 
decay is faster for larger p\ - the more significant features, the easier is classification. The figures 
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Figure 2: Average misclassification errors as functions of r for various combinations of p\ and L 
for Example 2. 


demonstrate also another interesting phenomenon: for moderate and large pi, the larger L, the 
faster is the decay. As we have argued, this is due to the fact that the impact of weaker significant 
features becomes stronger with increasing L. For small r (strong noise), misclassification errors 
are higher for larger number of classes L. This is naturally explained by the failure of feature 
selection procedure to detect significant features in this case (see comments above), so that the 
resulting classification is similar to a random guess with a misclassification error 1 — 1 /L (see 
Figures [l]{3]). However, as r increases, even the first few detected significant features strongly 
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Figure 3: Average misclassification errors as functions of r for various combinations of p± and L 
for Example 3. 


improve classification precision. 

5.2 Real-data example 

We applied feature selection techniques discussed above to a dataset of communication signals 
recorded from South American knife fishes of the genus Gymnotus. These nocturnally active 
freshwater fishes generate pulsed electrostatic fields from electric organ discharges (EODs). The 
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three-dimensional electrostatic EOD fields of Gymnotus can be summarized by two-dimensional 
head-to-tail waveforms recorded from underwater electrodes placed in front of and behind a fish. 
EOD waveforms vary among species and are used by genus Gymnotus in order to recognize its own 
kind for more productive mating and other purposes. 

The data set consists of 512-dinrensional vectors of the Symmlet-4 discrete wavelet transform 
coefficients of signals obtained from eight genetically distinct species of Gymnotus (G. arapaima 
(Gl), G. coatesi (G2), G. coropinae (G3), G. curupira (G4), G. jonasi (G5), G. mamiraua (G6), 
G. obscurus (G7), G. varzea (G8)) at various stages of their development. In particular, species 
were divided into six ontogenetic categories: postlarval (JO), small juvenile (J1), large juvenile (J2), 
immature adult (IA), mature male (M) and mature female (F). The EODs were recorded from 42 of 
48 possible combinations of eight species and six categories. There are 677 samples from 42 classes 
with sizes varying from 3 to 69. The complete description of the data can be found in Crampton 
et al. (2011). 

As it is evident from Crampton et al. (2011), there is no expectation that these groups should 
all be mutually separable: there is considerable overlaps between developmental stages of the same 
specie as well as among juveniles of different species. For this reason, we reduced the number of 
classes to include only those species/categories that might be potentially separated. In particular, 
we ran our feature selection and classification procedure with the data sets comprised of 10 to 
16 classes listed in the order they appear: G2-M, G4-M, G5-M, Gl-F, G2-F, G5-F, G7-F, G8-F, 
G2-J1, G4-J1, G2-F, Gl-Jl, G7-AI, Gl-F, G6-M, G7-J1. 

We splitted the respective data sets into training and test parts. For this purpose, in each class 
we chose at random at most 1/3 of the total number of observations for validation leaving the rest 
of the data as training samples. Using those training samples, we carried out feature selection and 
subsequent classification of vectors in the test part of the data set. We repeated the process 100 
times for various splits and recorded the average misclassification errors and their standard errors 
for each of the cases (L = 10,11,..., 16). Table 2 presents results of the study: the average sample 
sizes of train ( Ntrain ) and test ( N tes t ) sets for each L, the average number of selected significant 
features (p \) and average misclassification error with the corresponding standard errors. 

The table shows that when one starts with 10 well separated classes the misclassification error 
is initially grows when L increases from 10 to 13. However, at L = 13 there is a strong jump in the 
numbers of detected features and the misclassification errors again start to decrease when L grows 
from 13 to 15 due to better feature selection. For L > 15 the misclassification error grows again 
with L due to poor separation of juvenile Gymnotus EOD waveforms shapes. 
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L 

Ntrain 

attest 

Pi 

Misclassihcation error 

10 

32 

10 

67.0 

.077 (.006) 

11 

38 

13 

68.3 

.092 (.006) 

12 

46 

16 

65.3 

.127 (.007) 

13 

51 

18 

67.6 

.166 (.007) 

14 

57 

20 

83.7 

.149 (.006) 

15 

64 

23 

87.4 

.130 (.006) 

16 

68 

24 

86.8 

.162 (.007) 


Table 2: The sample sizes of train ( Ntrain ) and test ( Ntest ) sets, the numbers of selected significant 
features (pi ) and misclassihcation errors with standard errors in brackets averaged over 100 splits 
for the Gymnotus fish data. 

6 Concluding remarks 

The paper considers multi-class classification of high-dimensional normal vectors, where unlike 
standard assumptions, the number of classes may also be large. We propose a consistent feature 
selection procedure and derive the misclassihcation error of the resulting classification procedure. 

In particular, our results indicate an interesting phenomenon that the precision of classification 
can improve as a number of classes grows. This is, at first glance, a completely counter-intuitive 
conclusion and has not been observed so far due to shortage of literature on multi-class classification. 
It is explained by the fact that even weak significant features, that might be undetected for smaller 
L, strongly contribute to successful classification when the number of classes is large. 
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7 Appendix 

We start from recalling two lemmas of Birge (2001) that will be used further in the proofs. 
Lemma 1 (Lemma 8.1 of Birge, 2001). Let C ~ Xk ft> P > 0- Then, for any x > 0 

P(C > p + k + 2y 7 (k + 2p)x + 2 x) < e~ x (34) 
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and 


(35) 


P (C < p + k — 2y/(k + 2p)x) < e x 

Lemma 2 (Lemma 8.2 of Birge, 2001). Let X be a random, variable such that 

log IE (o'-*)] < ^ 

where a and b are positive constants. Then 


for 0 < s < b 1 , 


P[X > 2 ayfx + bx\ < e x for all x > 0. 


Proof of Theorem [l] 

Proof. Note that 


P(l ^ l) = P(l = k) < L\ maxP(Z~ = k ), 

k^i kjkl 


(36) 


For a given k Y l define a (2pi)-dimensional random vector Y = 
Yq , Y* and Y k are defined in (J6l) . A straightforward calculus yields 


YS - Y? 


1 , where the vectors 


AY* 

X 0~ x k/ 


N ( 9, V) with 6 = 


( ° P1 ], 

V = a 2 i 

\ m * - m lj 

\ 


'pT 1 


Pk 1 ^ 


(37) 


where pi is defined in ([3]). Then, it follows from (JhJ) that 

P(i =k)<p ( Pl (Y* 0 - Yr) t (E^)- 1 (Y^ - Yf) > p k (Y* 0 - Y* k )\T*)-\Y* 0 - Y* k )) = P(Y*AY > 0), 
where 


A = 


J PlXPl 
1*\-1 


PI &*)- 1 Op 

\ OpiXpi ~Pk (S ) 

~t ~ 

Consider a random variable £ = Y AY. Since V is a symmetric positive-definite matrix and 
A is symmetric, they can be simultaneously diagonalized, that is, there exists a matrix W, such that 
W t V~ 1 W = I and W f AW = A, where A is a diagonal matrix of the eigenvalues <pj, j = 1,..., 2pi 
of R = VA. Then, from the known results on the distribution of quadratic forms of normal variables 
(e.g., Imhof, 1961), £ can be represented a weighted sum of independent (generally) non-central 
chi-square variables as 

2pi 

3 = 1 


(38) 


where 7/ is such that 6 = Wr) with 6 given by (37). By a straightforward matrix calculus, obtain 

0, 


j^2 _ I (! PkPl) Ipt 


0 


PlXPl 


J pixpi 
(1 ~~ PkPl ) Ipi, 
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and, therefore, all eigenvalues ipj, j = 1,..., 2p\, of matrix R = VA are of the forms 


(fj = ±y>*, where </?* = y/l - pkpi, j = 1, • ■., 2pi (39) 

Consider now the logarithm of the moment generating function of the centered random variable 
£ — _E(£), where £ is defined in (38). We have Ef = Ylf=i <Pj( 1 + f/|) = Sj=i ^j, where recall 
that Wrj = 0. Hence, using formula (39), for s < l/(2</?*), we derive 

2 pi 


lnEe^-^)=£ V ^ jS 


1 


2 pi 


2 E to d-2 W )-»E»’j( 1 + 

5 1 J=1 5=1 




2 pi 

E 

5=1 


VjVj s 

1 — 2 tpjS 


rfipjS 


1 


2pi 


(ln(l — 2<^js) + 2<^js) 


2pi 9 2 2 2 2pi 2 2 

< ^5'^* , s P* 

• -* 1 — O/n-c ‘ 


5=1 


< 


2s" 


5 


' 1 — 2ip~s ■' 1 — 2<piS 1 — 2w*s 

7 = 1 J 7=1 J 


¥>« 


2 _|_ 2sV*Pi 


1 - ^S 2 


< 


2s 2 9|1 Il9 2 s 2 09 2 »i 

¥>*N I + 


Denote 


1 — 2</?*s 1 — 2^*s 

A 2 = (m* — m^.) t (E*)- 1 (m^ — 


Using W t V 1 W = I, W t AW = A and Wr] = 0 , one can verify that v , *ll r ?l | 2 = = 0 t AVA9 = 

p k A 2 , where 9 and V are defined in (37). Thus, 


In Ee s( £~ E ® < 


a 2 s 2 


where b = 2<p* and 


1 — bs ’ 

a = ^2p k A 2 + 2 tplpi < y/2 {yfpf |A| + <^*VPi) 


In addition, 

E^ = rfAq = 9 t A0 = -p k A 2 

A straightforward calculus shows that, under the condition (jrj) of Theorem [lj one has p k A 2 > 
2ayJ\n{L\/a) + b\n{L\/a). Then, applying Lemma [ 2 J one obtains 

P(£ > 0) < P (£ > — p k A 2 + 2ay / ln(Li/a) + 61n(Li/a)^ < 


a 

^1 


that, together with (36), complete the proof. 


□ 


Proof of Theorem [3] 

Proof. Let poi = zEj=i 7 {xj = 1 | xj = 0} and pn = Y^j=i 7{% = 1 | Xj = 1} be the numbers of 
erroneously and truly identified significant features respectively, where obviously poi and P 11 are 
independent, and poi + P 11 = Pi- Note that 

P(x / x) < P(poi > 0) + P(pn < pi) 
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Recall that for Xj = 0, the corresponding Q ~ x'l , • Let Uj, j = l,...,po be any, possibly 
correlated, xj Jt random variables. Then, 

P(Poi > 0) = P ( max Uj > A 

\l<j<P0 

Apply Lemma [l] for the particular case fi = 0 to obtain 

P (uj > L\ + 2y / Li ln(2p/a) + 21n(2p/a)^ < 
so that P(poi > 0) < a/2. 

Similarly, let /i* = mini<j< pi pj = mini<j< pi cr/ 2 n l@ij and consider any, possibly 

correlated, non-central chi-squared variables vj ~ Xl^^i 3 = 1, • •• ,Pi- We have 


^ < p P (uj > Li + 2\/Li ln(2p/a) + 2 ln(2p/a)^ 


P(pn < pi) < P ( min Vj < A ) < p P (vj < A) 

'yiXjXpi y 

A straightforward calculus shows that, under the condition (15) on p*, one has //* + L\ — 
2\J ( L\ + 2p*) ln(2p/a) > A. Thus, Lemma [l] yields P{vj < A) < a/(2 p) and, therefore, 
P{pn < Pi) < a/2, which completes the proof. 

□ 


Proof of Thereom [5] 

Proof. We start with the following lemma: 

Lemma 3. 


P max 

\i<i<p 


07 

— - 1 
a 2 


Kp,N,L,a ] 1 Q!j 


where u P) N,L,a was defined in (26). 

Let A be the event maxi<j< p <r 2 /cr 2 — 1 < n Pt N,L,a and I a its indicator. By Lemma jsj 

P(x fi x) < P fix fi x)Ijfi) + a, 


where 


P fix fi x)I A ) < P fifim > 0)1 A ) + P fipn < Pi) I a) 


(40) 

(41) 


Let () = <t j 2 Ya=i M Y U ~ Y j) 2 - Then, on the event A 


G: 


p (Cj > Ai )I A I Xj = 0 = P \ \ Uj > XiAj } Ia) < P{uj > A) 


<7; 


where Uj ~ xj^ > 3 = 1, ■ • • ,Po- Hence, following the arguments of Theorem [ 3 J by Lemma [l] 


P fifioi > 0)I A ) < P ( max Q > \i)I A | cc,- = 0 ) < P( max Uj > A) < — (42) 

i <j<p J i<3<po 2 


a 
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Similarly, 


P ({Cj < Xi)Ia | Xj = l) < P {Vj < Ai(l + Kp : N,L,a )) 
where Vj ~ x k , ■ 3 = 1, • ■ - ,Pi- Then, under the condition (15) of the theorem, Lemma jlj yields 

P ((pn < P\)Ia) < P min vj < Ai(l + k Pi n,l,o) ) 

\i<j<p i J 


a 

< - 
~ 2 


Combination of (40)-(43) completes the proof. 


(43) 


□ 


Proof of Theorem |6] 


Proof. Let Yq be from the Z-th class. From (18) we have P(l ^ l) < P(l ^ l \ x = x) + P(x / x), 


where P{x ^ x) <2a by Theorem [5j Consider a set P = {u; : x = x} with -P(P) > 1 — a. In order 
to bound above P(l ^ l \ x = x) we assume that w £ fl. We will use the following two lemmas: 

Lemma 4. If ||E* — E*|| < A m i n (E*)/2, then 

IKE*) -1 _ (S*)- 1 !) < 2 A“j n (S*) \0 - E*|| 


Lemma 5. Under the condition (31) 


P[ ||E* — E*|| < A max (E*) 


Cm 

N 


> 1-2 a 


From Lemma [4] and Lemma [5] it follows that under (31), 

P(||(^)- 1 -(E*)- I ||< 7piiiv ) >l-2a 


(44) 


where (3 P1: n is defined in (32). Furthermore, for any 1 < k < L, 


(Y 0 *-Y^f(S*)- 1 -(E*)- 1 )(YS-Y : 


^*\-i 


<||£* (£*)-!_(£*) 


<r 2 ||(S*)- i -(S*)- i )|| 


(Y5-y^*(e*)-i(ys-y*) 

Therefore, ( |44[ ) and ( |45[ ) imply that with probability at least 1 — 2a 
Pi (Y£ - YH t (^)~ 1 (YS - Yf) - p k (YS - Yl)\^*r\n - n) 

= Pl (vs - - y n - Pk c y * 0 - - n) 

+ Pl (Y 0 * - Y?)* ((ST 1 - (E*) _1 ) (Y3 - Y?) - Pk (Y3 - nY ((^)- 1 - (ST 1 ) (Y3 - Y l) 
<Pi{ 1 + ihuN) ( Y 0 - - Y?) - p fc (l - T,iv) (Y3 - Yiy&T\Y* 0 - Y* k ) 


(45) 


Define P [ = Pi (l + 7pi ,jv) and p' k = Pk (l - x p1 ,n)- In particular, note that P ' lP ' k = P i Pk ( 1 - 7pi 7V ). 


Repeating the proof of Theorem [T] but with P \ and P ' k and under the stronger condition (33), obtain 
P(l / l | x = x) < 2a that, together with (18) and P(x / x) < 2a, completes the proof. □ 
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Proof of Lemma |3] 

Proof. Note that 


Xn-l an d apply Lemma |1| to get 

'2 


# 


P 


a _l_ 1 

a 2 


l^p,N,L,a I C 


a 

p 


for all j = 1 ,... ,p and, therefore, 


P max 
\i <3<P 


-L - 1 

a 2 


— Kp,N,L,a I — 


< a 


□ 


Proof of Lemma [4] 

Proof. Under the condition of the lemma we have 

||(E*) _1 || _1 = min a f E*a > min a*E*a — max a*(E* — E*)a > A m i n (E*)/2 
IMI=i I l a l [~i ll a lf—i 

and, therefore, 

ll(S^)- 1 - (S*)- 1 !! < ||(S*) _1 || ■ ||E* - E*|| ■ ||(S*)- 1 || < 2A“? n (E*) ||f* - E*|| 


□ 


Proof of Lemma [5] 

Proof. Define Z,/ = Y* t — mjl ~ N(0 P1 , E*), i = 1, ..., nf, l = 1 ,,L. The sample covariance 
matrix is translation invariant and, therefore, 


1 


L n t 


1 


L ni 


1 


w E E( z « - z*)(z« - z/)* = ^ E E z « z « - m E- 52 


1=1 i =1 


JV 


;=i i=i 




;=i 


Thus, 


||S*-E*||<||5 1 -E*|| + ||5 2 || (46) 

By Remark 5.51 of Vershynin (2012), under the conditions of the lemma there exists an absolute 
constant Co such that 


P[ ||Si —E*|| <r 2 


Cm 

N 


>l — ct 


(47) 


Consider now Define the p\ x L-dimensional matrix Z with columns Z/, l = 1, ■ ■ • ,L and the 
diagonal matrix D = diag(y / r7j~, • • • , y/nf). It is easy to see that S 2 = IV -1 (ZD)(ZD) t and that 
matrix S = (E*)~ 1//2 ZD has i.i.d. N{ 0,1) entries. Indeed, columns S/ = yh^E*) -1 / 2 Z^ of matrix 
H are independent with Cov(H^) = I pi . Hence, 


IIS 2 II = N- 1 \\ZD\\ 2 = N- 1 ||\/S* H|| 2 < N- 1 A max (E 
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Then, by Corollary 5.35 of Vershynin (2012) 


^(||5 2 ||< N A max (£*) (VP1 +V 21n (2/a)) 2 ) > 1- 


a 


that, under (31), yields 


P{\\S2\\<9X mAX (J:*)N- 1 pi) > 1 - 


a 


Combination of (46)-(48) completes the proof with Ci = max(^^, 9). 


( 48 ) 

□ 
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